Goto

Collaborating Authors

 Arad


Impact and influence of modern AI in metadata management

Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong

arXiv.org Artificial Intelligence

Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.


Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models

Chen, Daoyuan, Huang, Yilun, Pan, Xuchen, Jiang, Nana, Wang, Haibin, Ge, Ce, Chen, Yushuo, Zhang, Wenhao, Ma, Zhijian, Zhang, Yilei, Huang, Jun, Lin, Wei, Li, Yaliang, Ding, Bolin, Zhou, Jingren

arXiv.org Artificial Intelligence

The burgeoning field of foundation models necessitates advanced data processing mechanisms capable of harnessing vast valuable data with varied types utilized by these models. Nevertheless, the current landscape presents unique challenges that traditional data processing frameworks cannot handle effectively, especially with multimodal intricacies. In response, we present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators spanning various modalities like text, image, audio, and video. With seamless compatibility and dedicated optimization to popular dataset hubs like Hugging Face and computing engines like Ray, Data-Juicer 2.0 enhances its predecessor in both usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Alongside this, it contains a core runtime layer optimized for adaptive execution and management across different dataset scales, processing demands, and computational environments, while shielding unnecessary system details. Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process tens of billions of data samples with tens of thousands of CPU cores. The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.


AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge

Wu, Xiaobao, Pan, Liangming, Xie, Yuxi, Zhou, Ruiwen, Zhao, Shuai, Ma, Yubo, Du, Mingzhe, Mao, Rui, Luu, Anh Tuan, Wang, William Yang

arXiv.org Artificial Intelligence

Data contamination hinders fair LLM evaluation by introducing test data into newer models' training sets. Existing studies solve this challenge by updating benchmarks with newly collected data. However, they fail to guarantee contamination-free evaluation as the newly collected data may contain pre-existing knowledge, and their benchmark updates rely on intensive human labor. To address these issues, we in this paper propose AntiLeak-Bench, an automated anti-leakage benchmarking framework. Instead of simply using newly collected data, we construct samples with explicitly new knowledge absent from LLMs' training sets, which thus ensures strictly contamination-free evaluation. We further design a fully automated workflow to build and update our benchmark without human labor. This significantly reduces the cost of benchmark maintenance to accommodate emerging LLMs. Through extensive experiments, we highlight that data contamination likely exists before LLMs' cutoff time and demonstrate AntiLeak-Bench effectively overcomes this challenge.


Adaptive multiple optimal learning factors for neural network training

Challagundla, Jeshwanth

arXiv.org Artificial Intelligence

The Univer sity of Texas at Arlington, 2015 Sup ervising Professor: Michael Manry There is always an ambiguity in deciding the number of learning factors that is really required for training a Multi - Layer Perceptron. This thesis solves this problem by introducing a new method of adaptively changing the number of learning factors computed based on the error change created per multiply. A new method is introduced for computing learning factors for weights grouped based on the curvature of the objective function. A method for linearly compressing large ill - conditioned Newton's Hessian matrices to smaller well - conditioned ones is shown. This thesis also shows that the proposed training algorithm adapts itself between two other algorithms in order to produce a better error decrease per multiply. The performanc e of the proposed algorithm is shown to be better than OWO - MOLF and Levenberg Marquardt for most of the data sets.


Multiple Imputation for Biomedical Data using Monte Carlo Dropout Autoencoders

Miok, Kristian, Nguyen-Doan, Dong, Robnik-Šikonja, Marko, Zaharie, Daniela

arXiv.org Machine Learning

Due to complex experimental settings, missing values are common in biomedical data. To handle this issue, many methods have been proposed, from ignoring incomplete instances to various data imputation approaches. With the recent rise of deep neural networks, the field of missing data imputation has oriented towards modelling of the data distribution. This paper presents an approach based on Monte Carlo dropout within (Variational) Autoencoders which offers not only very good adaptation to the distribution of the data but also allows generation of new data, adapted to each specific instance. The evaluation shows that the imputation error and predictive similarity can be improved with the proposed approach.


Adversarial Fault Tolerant Training for Deep Neural Networks

Duddu, Vasisht, Rao, D. Vijay, Balas, Valentina E.

arXiv.org Machine Learning

Deep Learning Accelerators are prone to faults which manifest in the form of errors in Neural Networks. Fault Tolerance in Neural Networks is crucial in real-time safety critical applications requiring computation for long durations. Neural Networks with high regularisation exhibit superior fault tolerance, however, at the cost of classification accuracy. In the view of difference in functionality, a Neural Network is modelled as two separate networks, i.e, the Feature Extractor with unsupervised learning objective and the Classifier with a supervised learning objective. Traditional approaches of training the entire network using a single supervised learning objective is insufficient to achieve the objectives of the individual components optimally. In this work, a novel multi-criteria objective function, combining unsupervised training of the Feature Extractor followed by supervised tuning with Classifier Network is proposed. The unsupervised training solves two games simultaneously in the presence of adversary neural networks with conflicting objectives to the Feature Extractor. The first game minimises the loss in reconstructing the input image for indistinguishability given the features from the Extractor, in the presence of a generative decoder. The second game solves a minimax constraint optimisation for distributional smoothening of feature space to match a prior distribution, in the presence of a Discriminator network. The resultant strongly regularised Feature Extractor is combined with the Classifier Network for supervised fine-tuning. The proposed Adversarial Fault Tolerant Neural Network Training is scalable to large networks and is independent of the architecture. The evaluation on benchmarking datasets: FashionMNIST and CIFAR10, indicates that the resultant networks have high accuracy with superior tolerance to stuck at "0" faults compared to widely used regularisers.


8-Valent Fuzzy Logic for Iris Recognition and Biometry

Popescu-Bodorin, N., Balas, V. E., Motoc, I. M.

arXiv.org Artificial Intelligence

This paper shows that maintaining logical consistency of an iris recognition system is a matter of finding a suitable partitioning of the input space in enrollable and unenrollable pairs by negotiating the user comfort and the safety of the biometric system. In other words, consistent enrollment is mandatory in order to preserve system consistency. A fuzzy 3-valued disambiguated model of iris recognition is proposed and analyzed in terms of completeness, consistency, user comfort and biometric safety. It is also shown here that the fuzzy 3-valued model of iris recognition is hosted by an 8-valued Boolean algebra of modulo 8 integers that represents the computational formalization in which a biometric system (a software agent) can achieve the artificial understanding of iris recognition in a logically consistent manner.


From Cognitive Binary Logic to Cognitive Intelligent Agents

Popescu-Bodorin, Nicolaie, Balas, Valentina E.

arXiv.org Artificial Intelligence

The relation between self awareness and intelligence is an open problem these days. Despite the fact that self awarness is usually related to Emotional Intelligence, this is not the case here. The problem described in this paper is how to model an agent which knows (Cognitive) Binary Logic and which is also able to pass (without any mistake) a certain family of Turing Tests designed to verify its knowledge and its discourse about the modal states of truth corresponding to well-formed formulae within the language of Propositional Binary Logic.